17 research outputs found
Het-node2vec: second order random walk sampling for heterogeneous multigraphs embedding
We introduce a set of algorithms (Het-node2vec) that extend the original
node2vec node-neighborhood sampling method to heterogeneous multigraphs, i.e.
networks characterized by multiple types of nodes and edges. The resulting
random walk samples capture both the structural characteristics of the graph
and the semantics of the different types of nodes and edges. The proposed
algorithms can focus their attention on specific node or edge types, allowing
accurate representations also for underrepresented types of nodes/edges that
are of interest for the prediction problem under investigation. These rich and
well-focused representations can boost unsupervised and supervised learning on
heterogeneous graphs.Comment: 20 pages, 5 figure
GraPE: fast and scalable Graph Processing and Embedding
Graph Representation Learning methods have enabled a wide range of learning
problems to be addressed for data that can be represented in graph form.
Nevertheless, several real world problems in economy, biology, medicine and
other fields raised relevant scaling problems with existing methods and their
software implementation, due to the size of real world graphs characterized by
millions of nodes and billions of edges. We present GraPE, a software resource
for graph processing and random walk based embedding, that can scale with large
and high-degree graphs and significantly speed up-computation. GraPE comprises
specialized data structures, algorithms, and a fast parallel implementation
that displays everal orders of magnitude improvement in empirical space and
time complexity compared to state of the art software resources, with a
corresponding boost in the performance of machine learning methods for edge and
node label prediction and for the unsupervised analysis of graphs.GraPE is
designed to run on laptop and desktop computers, as well as on high performance
computing cluster
GRAPE for fast and scalable graph processing and random-walk-based embedding
Graph representation learning methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using specialized and smart data structures, algorithms, and a fast parallel implementation of random-walk-based methods. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as competitive edge- and node-label prediction performance. GRAPE comprises approximately
1.7 million well-documented lines of Python and Rust code and provides 69 node-embedding methods, 25 inference models, a collection of efficient graph-processing utilities, and over 80,000 graphs from the literature and other sources. Standardized interfaces allow a seamless integration of third- party libraries, while ready-to-use and modular pipelines permit an easy-to- use evaluation of graph-representation-learning methods, therefore also positioning GRAPE as a software resource that performs a fair comparison between methods and libraries for graph processing and embedding
Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer.
Inhibiting protein kinases (PKs) that cause cancers has been an important topic in cancer therapy for years. So far, almost 8% of \u3e530 PKs have been targeted by FDA-approved medications, and around 150 protein kinase inhibitors (PKIs) have been tested in clinical trials. We present an approach based on natural language processing and machine learning to investigate the relations between PKs and cancers, predicting PKs whose inhibition would be efficacious to treat a certain cancer. Our approach represents PKs and cancers as semantically meaningful 100-dimensional vectors based on word and concept neighborhoods in PubMed abstracts. We use information about phase I-IV trials in ClinicalTrials.gov to construct a training set for random forest classification. Our results with historical data show that associations between PKs and specific cancers can be predicted years in advance with good accuracy. Our tool can be used to predict the relevance of inhibiting PKs for specific cancers and to support the design of well-focused clinical trials to discover novel PKIs for cancer therapy
KG-COVID-19: A Framework to Produce Customized Knowledge Graphs for COVID-19 Response.
Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates heterogeneous biomedical data to produce knowledge graphs (KGs), and applied it to create a KG for COVID-19 response. This KG framework also can be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics
KG-Hub-building and exchanging biological knowledge graphs.
MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking.
RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification.
AVAILABILITY AND IMPLEMENTATION: https://kghub.org
Semantic integration of clinical laboratory tests from electronic health records for deep phenotyping and biomarker discovery.
Electronic Health Record (EHR) systems typically define laboratory test results using the Laboratory Observation Identifier Names and Codes (LOINC) and can transmit them using Fast Healthcare Interoperability Resource (FHIR) standards. LOINC has not yet been semantically integrated with computational resources for phenotype analysis. Here, we provide a method for mapping LOINC-encoded laboratory test results transmitted in FHIR standards to Human Phenotype Ontology (HPO) terms. We annotated the medical implications of 2923 commonly used laboratory tests with HPO terms. Using these annotations, our software assesses laboratory test results and converts each result into an HPO term. We validated our approach with EHR data from 15,681 patients with respiratory complaints and identified known biomarkers for asthma. Finally, we provide a freely available SMART on FHIR application that can be used within EHR systems. Our approach allows readily available laboratory tests in EHR to be reused for deep phenotyping and exploits the hierarchical structure of HPO to integrate distinct tests that have comparable medical interpretations for association studies
The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species.
In biology and biomedicine, relating phenotypic outcomes with genetic variation and environmental factors remains a challenge: patient phenotypes may not match known diseases, candidate variants may be in genes that haven\u27t been characterized, research organisms may not recapitulate human or veterinary diseases, environmental factors affecting disease outcomes are unknown or undocumented, and many resources must be queried to find potentially significant phenotypic associations. The Monarch Initiative (https://monarchinitiative.org) integrates information on genes, variants, genotypes, phenotypes and diseases in a variety of species, and allows powerful ontology-based search. We develop many widely adopted ontologies that together enable sophisticated computational analysis, mechanistic discovery and diagnostics of Mendelian diseases. Our algorithms and tools are widely used to identify animal models of human disease through phenotypic similarity, for differential diagnostics and to facilitate translational research. Launched in 2015, Monarch has grown with regards to data (new organisms, more sources, better modeling); new API and standards; ontologies (new Mondo unified disease ontology, improvements to ontologies such as HPO and uPheno); user interface (a redesigned website); and community development. Monarch data, algorithms and tools are being used and extended by resources such as GA4GH and NCATS Translator, among others, to aid mechanistic discovery and diagnostics
Recommended from our members
Codes on Graphs and Analysis of Iterative Algorithms for Reconstructing Sparse Signals and Decoding of Check-Hybrid GLDPC Codes
The necessity for fast and efficient algorithms in different fields of communications and signal processing have led to developing low-complexity iterative algorithms. In the fields of compressed sensing and channel coding, which are the main focus of this dissertation, designing low-complexity iterative algorithms with excellent performance has been of interest for many years. Recently, there has been a significant interest to understanding the failures of the iterative reconstruction and decoding algorithms. Knowing the failures of the algorithms may improve the performance either by designing new algorithms or by providing new conditions on the input of the algorithms under which the previous failures of algorithms are disabled. In the first part of this dissertation, we consider an iterative reconstruction algorithm called the interval-passing algorithm (IPA) which was originally introduced to reconstruct non-negative signals from binary measurement matrices. We first modify the IPA to reconstruct signals from non-negative measurement matrices and compare the performance of the IPA with two reconstructing algorithms, the verification algorithm and the linear programming technique for recovery of signals. The results show that the IPA is a good trade-off between a very simple verification algorithm and the complex linear programming technique. We also show the failures of the IPA on some subgraphs in the Tanner graph corresponding to the measurement matrix called stopping sets and analyze the failures and successes of the IPA on subsets of stopping sets. We provide sufficient conditions under which the IPA succeeds the recovery of the signal. Reconstruction performance of the IPA using different LDPC measurement matrices is given to show the effect of stopping sets. In the second part of the dissertation, a method for constructing a class of codes called the check-hybrid generalized LDPC (CH-GLDPC) is provided. In CH-GLDPC codes, some single parity-checks are replaced by super checks corresponding to the shorter and stronger error correcting codes. However, the main feature of our method is to carefully replacing super checks such that harmful structures in the Tanner graph of the LDPC codes called trapping sets are eliminated. The second purpose is to reduce the rate-loss caused by replacing super checks through finding the minimum number of super checks needed for eliminating a certain trapping set. To construct these codes, we first use the knowledge of trapping sets of LDPC codes over the binary symmetric channel (BSC) and systematically replace super checks to disable a trapping set. We then provide upper bounds on the minimum number of super checks needed to eliminate all trapping sets of a certain size in the Tanner graph of an LDPC code. The guaranteed error correction capability of the CH-GLDPC codes is also studied. The results are extended to different classes of LDPC codes and iterative decoders
ChIPWig - A state of the art compression method for ChIP-seq data
Poster presentation for the BD2K All Hands Meeting at DC, November 29-30th, 2016. Results generated as part of the BD2K Targeted Software Development Program award to UIUC